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Abstract 

Background: Simple peak-picking algorithms, such as those based on lineshape fitting, perform well when peaks 
are completely resolved in multidimensional NMR spectra, but often produce wrong intensities and frequencies for 
overlapping peak clusters. For example, NOESY-type spectra have considerable overlaps leading to significant 
peak-picking intensity errors, which can result in erroneous structural restraints. Precise frequencies are critical for 
unambiguous resonance assignments. 

Results: To alleviate this problem, a more sophisticated peaks decomposition algorithm, based on non-negative 
matrix factorization (NMF), was developed. We produce peak shapes from Fourier-transformed NMR spectra. Apart 
from its main goal of deriving components from spectra and producing peak lists automatically, the NMF approach 
can also be applied if the positions of some peaks are known a priori, e.g. from consistently referenced spectral 
dimensions of other experiments. 

Conclusions: Application of the NMF algorithm to a three-dimensional peak list of the 23 kDa bi-domain section of 
the RcsD protein (RcsD-ABL-HPt, residues 688-890) as well as to synthetic HSQC data shows that peaks can be 
picked accurately also in spectral regions with strong overlap. 
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Background 

The precise estimation of the frequencies of peaks in nu- 
clear magnetic resonance (NMR) spectra is often com- 
pUcated by poor signal-to-noise ratio and peak overlap. 
This results in only partially complete and correct peak 
picking. The problem aggravates especially when the 
peaks are highly overlapped. This is compounded by 
combinatorial ambiguity problems for resonance assign- 
ments and increases errors in NOE distance restraints 
[1]. To alleviate this problem, a more sophisticated peak 
decomposition algorithm, based on non-negative matrix 
factorization (NMF), has been developed and applied to 
three-dimensional (3D) NMR spectra. 

Non-negative Matrix Factorization was first intro- 
duced by Paatero and Tapper as the concept of positive 
matrix factorization [2,3] for estimating errors in widely 



* Correspondence: guentert@em.uni-frankfurt.de 
^Institute of Biophysical Chemistry, Center for Biomolecular Magnetic 
Resonance, and Frankfurt Institute of Advanced Studies, Goethe University 
Frankfurt am Main, Max-von-Laue-Str. 9, 60438 Frankfurt am Main, Germany 
^Graduate School of Science and Engineering, Tokyo Metropolitan University, 
Hachioji, Tokyo 192-0397, Japan 

(3 BioMed Central 



varying environmental data. Their work revealed the non- 
negativity features of the underlying data models. Lee and 
Seung [4,5] showed using an effective multiplicative algo- 
rithm parts-based representation of an object using NMF 
approach. A recent in-depth review on NMF algorithms 
discusses many forms of factorizations [6]. Because of the 
non-negativity and the sparseness constraints [7], NMF 
has wide applications in multidimensional data analysis 
[8-15]. The idea originated from the fact that in certain 
applications, by the rules of physics, the data quantities 
cannot be negative. The NMF approach was reported in 
application to complex metabolomic mixture analysis in 
two-dimensional NMR spectra [16]. Higher dimensional 
NMR spectral data matrices can be decomposed using 
NMF algorithms. The important property of NMF is the 
non-negative nature of the decomposed factors. There- 
fore, NMF processing of higher dimensional NMR spec- 
tral data can have important consequences in automated 
data processing. 

In the automated peak picking approach peak identifi- 
cation is followed by the estimation of peak intensities 
and frequencies. Several algorithms have already been 
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developed to perform peak picking in NMR spectra. 
Most of these algorithms are based on finding local 
maxima that fulfill certain criteria, and/or use Gaussian 
or Lorentzian lineshapes for lineshape fitting [17-21] by 
minimizing the residual squared error between the ob- 
served peak shape and the assumed lineshape. Apart 
from the lineshape fitting methods, PICKY [22] is an- 
other program that uses a singular value decomposition 
of peak components for peak picking. In general, highly 
overlapped peaks cause the most commonly observed 
problems of existing peak picking algorithms. We used 
the basic two-dimensional (2D) NMF model, extended 
sequentially to ND usage to decompose a 3D data (sig- 
nal) tensor. A Euclidean distance cost function was used 
as a measure of factorization convergence. The approach 
allows applying constraints if some information is 
known a priori, e.g. the total number of peaks or posi- 
tions in common dimensions of hyper-dimensional (HD) 
shapes [23]. We discuss the NMF algorithm, the condi- 
tions for unique solutions of NMF models, and its appli- 
cations to decompose 3D NMR signal tensors. 

Methods 

NMF decomposition algorithm 

The basic idea of spectral factorization is to represent 
the multidimensional NMR spectrum as well as possible 
by a sum of direct products of one-dimensional shapes. 
The latter are expected to represent the lineshapes of res- 
onances. In this way, i.e. if each one-dimensional shape 
represents a resonance line, possible overlap is deconvo- 
luted and the factorization of the spectrum is equivalent 
to peak picking. The exact peak positions can simply be 
obtained by determining the (interpolated) maxima within 
the one-dimensional shapes. If it is known that the real 
signals in a spectrum are non-negative, a better result can 
be expected by introducing this condition into the 
factorization algorithm. 

The non-negative factorization (NMF) problem may 
be described as follows. Given the observed data matrix 
Y= [y{l),y{2), ...,3/(1)] ei?^^^ with Y>0, The solution is 
to find two matrices with only non-negative elements, 
the basis or mixing matrix A g R'^ ^ and the source 
matrix X= [x{l),x{2), .,,,x{T)] gR^"" ^, where r represents 
the number of true components [24]. The source matrix 
is expected to produce the unknown latent components 
of the original data matrix Y, The problem is to factorize 
the given data matrix such that it minimizes the squared 
Euclidean distance between the observed data matrix 
and the product of two non-negative data matrices i.e. 

Y =AX + N, 

with A>0 and X> 0, where NeR^^-^ is a noise or error 
matrix that is to be minimized. 



The divergence cost function is expressed in terms of 
the squared Euclidean distance given as 

Df{A,X) = \\Y-AXf 

The objective is to minimize the divergence of this func- 
tion using a standard gradient descent technique. The di- 
vergence is calculated by component-wise calculation of 
the distance between matrixes Y and AX. The minimization 
is achieved using multiplicative update rules that update 
the matrices A and X iteratively until a minimum squared 
Euclidean distance is reached. The updates may be per- 
formed until as much minimum possible distance lead- 
ing to a nonnegative solution is achieved. Any increase 
in the squared Euclidean distance may lead to an incor- 
rect solution. 

In matrix notation, the multiplicative update rules 
become 

A^AQYX^0AXX^ 

X^XQA^Y0A^AX 

where 0 and 0 represent component-wise multiplica- 
tion and division, respectively [25]. The proposed multi- 
plicative update rules were originally introduced in the 
image space reconstruction algorithm (ISRA) [26]. The 
algorithm performs minimization of the squared Euclid- 
ean distance cost function using a gradient descent tech- 
nique. The technique uses alternate switching between 
sets of parameters to generate updates on the matrices A 
and X until convergence is reached. The original ISRA 
algorithm used multiplicative updates rules by updating 
only the matrix X iteratively and assuming the matrix A 
to be known [25,26]. The convergence to a nonnegative 
solution is obtained for any positive starting point given 
that the original input matrices contain hidden source 
components [27]. 

Extension to 3D non-negative tensor factorization 

The 3D non-negative tensor factorization (NTF) model 
may be defined as an extension of the basic 2D models. 
Some 3D NTF models can be solved using the basic 2D 
NMF models referred to as the NTF2 model [24]. The 
model is illustrated in Figure 1 and is described as 
follows. 

Y, = AD,X^ + TV, = AX, + N,, = 1, Q) 

where Yq = \yitq\ e R^"^^ is a q-\h frontal slice (matrix) of 
the observed 3D data (signal) tensor YgI^'^^'^^, The com- 
ponent matrices A = [aij] = [uiAii '"■>cij\ el^"^^ is a mixing 
or basis matrix and X, ' = [Xj^,] e R^'^^gives unknown nor- 
malized hidden components in q-th. slice. The matrix 
Xq = ' = [Xjtq\ e R^"^^ represents re-normalized source 
components and Nq = [nitq\ e R!"^^ represents the q-th. 
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Figure 1 Illustration of the NTF2 model for peak picking a 3D NMR spectrum. The matrix / is a 3D input NMR spectral data matrix. Aq and 
Xq are 2D matrices factorized from q^^ data plane of matrix Y representing the basis matrix and the source component matrix, respectively. N is 
the 3D matrix representing the residual error of the non-negative matrix factorization. 



frontal slice of the tensor NeR^'''^''^ representing 
noise in the input matrix Y. 

Determining the true number of components 

For the NMF/NTF models the true number of compo- 
nents r plays an important role in reaching convergence 
because an approximate valid model is instrumental in 
capturing the true underlying structure in the data. The 
true number of components may be obtained using sev- 
eral approximate and heuristic techniques [28-32]. Dif- 
ferent numbers of components may result in different 
residual minima. In the present work, we applied the fol- 
lowing procedure for calculating the true number of 
components. We observed the decay of residual values 
for different number of components r = 2, 3, 4, 5, 6, and 
7. The NMF iterations were stopped when the residual 
showed no improvement over 10 or more consecutive it- 
erations. The true number of components was obtained 
as the one that showed the minimum residual value. It 
was also observed that using a higher number of compo- 
nents than the true number did not yield better mini- 
mum residual values. 

Uniqueness conditions for the 2D NMF ambiguity 
problem 

The quadratic cost function with respect to matrices A 
and X may have many local minima, which leads to rota- 
tional ambiguity of the factorized matrices [33] . Therefore 
the alternating minimization of the squared Euclidean cost 
function may not result in a unique NMF solution. How- 
ever, applying some preprocessing or filtering of the input 
matrix is sufficient to solve the NMF problem uniquely. 
The preprocessing involves normalization of the input 
matrix or normalizing the columns of the factorized 
matrix A and/or the rows of matrix X We normalized 
each column of the matrix A to unit /i-norm. In addition, 
we normalized the input matrix Y to unit length at the 



beginning of the factorization and later used the corre- 
sponding scaling factor to obtain the original intensity 
values of the component peaks. 

Peak picking of 3D NMR spectral data 

The 3D NMF decomposition is performed as an ex- 
tended 2D NMF decomposition as described above. 
Each slice of 2D data, taken from a 3D spectral data 
matrix Y, occupies data in the matrix Y^ representing the 
q^^ point in the third dimension. The model factorizes 
peak components in one-dimensional (ID) peak shapes 
in the source matrix X^, Peak positions are obtained by 
fitting an ideal Gaussian shape of average linewidth 
to the observed component by minimizing the scalar 
product between the Gaussian shape and the observed 
component. Next, the linewidth of the peak is adapted 
to obtain an optimal agreement. The final peak positions 
are obtained by performing a three-point parabolic 
interpolation. The final peak lists are obtained by apply- 
ing a user-defined intensity threshold. 

Spectral data sets 

The NTF2 model was applied using the 2D NMF algo- 
rithm to the previously measured 3D HNCO spectrum for 
the 23 kDa RcsD-ABL-HPt protein construct (residues 
688-890) [34]. The spectrum was measured on a Bruker 
AVANCE spectrometer operating at Larmor frequency 
of 950 MHz. The numbers of time-domain complex data 
points were 128 and 90 in ^^C and ^^N indirect dimen- 
sions respectively. Non-uniform sampling schedule was 
employed in both the indirect dimensions at a level of 
sparseness of 10 per cent, which acquired 1152 FIDs. The 
203 amino acid protein gave rise to 195 expected peaks in 
the 3D HNCO experiment [35]. 

Additionally, the method was applied to a group of 
four synthetic signals with known positions. The syn- 
thetic signals were of different intensities and were used 
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to construct a 2D HSQC spectrum. The efficiency of the 
algorithm for peak picking was assessed with different 
levels of noise in the spectrum. The noise was added in- 
crementally in steps of 10 percent each. The separation 
of peak positions was varied from 10 to 1 points in steps 
of one data point each. 

Results and discussion 

Figure 2 shows peaks picked from the 3D HNCO 
spectrum of the RcsD-ABL-HPt protein on a ^H-^^C 
projection. The model was able to pick 201 backbone 
and side-chain cross peaks from the HNCO spectrum. 
The peak list of 201 peaks is given in Additional file 1: 
Table SI. The NTF2 decomposition of a small region of 
about eight overlapped peaks of the 3D HNCO experi- 
ment is shown in Figure 3. The peak shapes determined 
for each dimension are plotted below each 2D projec- 
tion. Peak shapes appearing in the same color in each 
2D projection define one peak in the 3D spectrum. 
Table 1 lists the eight peaks picked in this overlap region 
with their assignments. 

It can be seen that the overlapped peaks were well 
decomposed in all three dimensions of the HNCO experi- 
ment. Among the 201 peaks picked from the HNCO 
spectrum there were about 23 peaks that were overlapped 
in one or more dimensions. The NTF2 model was able to 
decompose all the overlapped peaks. Table 1 shows two 
peaks assigned to Cysteine 860 and to Leucine 859 which 



are overlapped in all three dimensions. The peaks for 
these two residues were well decomposed in all three di- 
mensions as shown in Figure 3. This shows that correct 
peak picking of 3D NMR spectral data especially in 
overlapped regions is possible using the NTF2 model. 

The HSQC spectrum built using four synthetic signals 
is shown in Figure 4A. The results of peak picking show 
that the algorithm could tolerate up to 60% noise in the 
spectrum when the peaks were separated by at least 7 
points from each other. As the peaks were moved closer 
to each other by one point at each step, the noise toler- 
ance started to drop. The result is shown in Figure 4B. 
When peaks were closer than two points, even 10% noise 
in the spectrum generated incorrect peak intensities. 

The change in peak position of the peak with the highest 
intensity with varying levels of noise in the spectrum is 
plotted in Figure 4C. The result shows that the algorithm 
could tolerate up to 70% noise in the spectrum for factor- 
izing the peak shape at its true position. It may be noted 
that the known peak position is being observed for its 
change with increasing noise in the spectrum. Therefore, 
with higher than 70% of noise in the spectrum, the peak 
appeared at its true position albeit with incorrect line- 
width and intensity. Many other peaks resembled the true 
peak in the factorized ID components (Matrix X^). There- 
fore, it may be concluded that the algorithm could not re- 
liably determine the true peak when more than 70% noise 
were present in the spectrum. Blue points indicate that 
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Figure 2 ^H-^^C projection of thie 3D HNCO NMR spectrum of RcsD-ABL-HPt (DAH) construct shiowing tiie pealcs piclced using the NTF2 


model. Picked peaks are marked by a cross at the peak center. 





Tikole et al. BMC Bioinformatics 2014, 15:46 
http://www.bionnedcentral.conn/1471 -21 05/1 5/46 



Page 5 of 7 




ppm(^H) ppm(l5N) ppmC^C) 



Figure 3 Non-negative matrix factorization of overlapped pealcs of a small region (^H: 8.095-8.298ppm, ^^N: 1 18.81 3-1 23.806ppm, 
^^C: 175.91 0-1 76.791 ppm) of the 3D HNCO spectrum of the RcsD-ABL-HPt (DAH) protein. A) H-^^C projection. B) H-^^N projection. 
C) ^^C-^^N projection. Tlie upper row sliows tine peal<s in 2D projections. Tlie lower row sliows tine peal<s factorized in ID sliapes from tine 
corresponding projections. Tine peal<s positions and intensities were obtained using a tliree-point parabolic interpolation. 



the algorithm was able to distinguish the peak from noise. 
Red points indicate that many other peaks resembled the 
true peak because of higher noise in the spectrum. Be- 
cause the original position of the peak was known, the dif- 
ference in peak position could still be plotted in Figure 4C 
at noise levels higher than 70%. However, the matrix 
factorization residual and the number of peaks picked in 
each component were at unacceptable levels. 



Table 1 Peak list obtained by applying the NTF2 model 
to the overlapped region of the 3D HNCO spectrum of 
the RcsD-ABL-HPt construct shown in Figure 3 





Pealc position (ppm) 


Peak 


Peak 








intensity 


assignment 


8.275 


122.868 


1 76.380 


8.013 X 10^ 


C860 


8.184 


122.890 


1 76.380 


1.434 X 10^° 


E861 


8.271 


122.861 


1 76.420 


2.011 X 10^ 


L859 


8.142 


122.732 


176.018 


9.192 X 10^ 


R824 


8.241 


121.000 


176.412 


3.430 X 10^ 


R689 


8.167 


122.246 


1 76.243 


9.482 X lO"" 


Q858 


8.245 


119.853 


176.731 


2.509 X 10^ 


R726 


8.219 


119.612 


1 76.367 


1.157 X 10^° 


T840 



Peak assignments are shown in the last column. 



Similarly, the effect of noise on the change in peak in- 
tensity was observed on the peak selected for Figure 4C. 
Figure 4D shows the change in peak intensity with in- 
creasing level of noise in the spectrum. The algorithm 
could tolerate up to 70% noise in the spectrum for cor- 
rect peak picking. The peak may be accepted or rejected 
at a different intensity depending on the user intensity 
tolerance threshold with higher noise levels in a 
spectrum. With more than 70% noise in the spectrum, 
incorrect intensities and linewidths were obtained after 
factorization. 

In general, the method works well on overlapped 
peaks for two different reasons. First, the 3D peak pick- 
ing data is reduced to a 2D peak picking data matrix. 
The 2D factorization is performed at all points in the 
third dimension. In case of overlapped peaks, for ex- 
ample, if two peaks are separated by only one or two 
points, the factorization can give two peaks separately in 
each data plane from the points that separate the two 
peaks. This becomes highly improbable in case of peak 
picking based on lineshape fitting. Second, if the peak 
position from one of the dimensions is known a priori 
then the 2D data matrix comprising unknown peak pos- 
ition can be factorized to get the peak position in the 



Tikole et al. BMC Bioinformatics 2014, 15:46 
http://www.bionnedcentral.conn/1 471 -21 05/1 5/46 



Page 6 of 7 





0 20 40 60 80 100 0 20 40 60 80 100 

Noise (%) Noise (%) 



Figure 4 Noise and pealc overlap tolerance of the NTF2 model for peak factorization of a synthetic HSQC spectrum. A) HSQC spectrum 
constructed using four synthetic signals. B) Effects of noise and peak overlap on peak picking. The amount of noise in the spectrum is shown on 
the X-axis. The y-axis shows the peak separation in number of points. Red circles indicate that peaks were incorrectly picked. Blue circles show 
that the peaks were correctly picked upon factorization. C) Effects of the amount of noise on the peak position determination. Differences in the 
peak position in number of points are shown on the y-axis. The x-axis shows the amount of noise in the spectrum. Blue points indicate that the 
peak was correctly picked. Red points indicate that the peak was obtained with incorrect parameters because of higher noise in the spectrum. 
D) Effects of the amount of noise on the peak intensity: The x-axis shows the amount of noise present in the spectrum. Differences in the peak 
intensity of the peak are shown on the y-axis. Blue points show that the peak was distinguishable from the noise. Red points show that the peak 
was obtained with incorrect parameters upon factorization. 

V J 



other dimension(s). The method has useful consequences 
when peaks are to be picked from hyper-dimensional data 
matrices as discussed below. 

The usefulness of the NTF2 model becomes apparent 
when multi-dimensional NMR spectra have commonly 
referenced dimensions. Thus, if the peak positions from 
an already measured more sensitive NMR experiment 
such as 3D HNCO are known, then the same peak posi- 
tions can be used to pick peaks in other spectra that 
have dimensions in common with HNCO, for example, 
and/or ^^N. Peak picking overlapped regions becomes 
easier with the NTF2 model if each common dimension 
has consistent spectral referencing such as the spectral 
width, the carrier frequency positions and the number of 
sampled points. Theory suggests that the components in 
matrix A are assumed known and the matrix X gives the 



hidden source components of the input matrix Y [25]. 
The advantage that the NTF2 model offers in decompos- 
ing a 3D NMR spectral data tensor is that peak positions 
from any two dimensions can be assumed known. There- 
fore, the reduced NTF2 model can offer accurate peak 
position solutions especially when the peaks are over- 
lapped. This notion can be extended naturally to hyper- 
dimensional NMR spectral data tensors [23]. If peak 
positions from one or more dimensions are known, the 
peaks can be picked in the remaining dimensions by ap- 
propriately selecting the corresponding 2D data planes 
containing the picked peaks from the 3D data tensors. 

Conclusion 

We have developed a method based on non-negative 
matrix factorization that can be used for peak picking 
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3D NMR spectral data tensors. Our results demonstrate 
that the method is particularly useful for picking over- 
lapped peaks. Additionally the method can be easily ex- 
tended for peak picking three- or higher-dimensional 
NMR spectral data tensors that have commonly refer- 
enced dimensions. 

Additional file 



Additional file 1: Table SI. Peak list obtained by applying the NTF2 
model to the entire 3D HNCO spectrum of the RcsD-ABL-HPt construct 
shown in Figure 2. 
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